Learn the five number summary and compare to a box plot.
Note on Accessibility:
Data visualization is the production of charts and graphs that reveal trends and interrelationships to the eye. Although it can be done well by and for low-vision and colorblind users, dataviz is fundamentally visual, so it’s problematic for the blind. Data methods by and for the blind should strongly consider other techniques. Nevertheless, it’s important for both blind and sighted students of statistics to understand the larger discipline. I’ve tried to make these notes useful to everyone, understanding that a histogram might not be.
The Scatter Plot
A scatterplot is a visualization of two quantitative variables, appropriate when at least two numerical values are recorded about each data case. The \(x\) axis is given the range of values of one variable, and the \(y\) axis the other. Then each data case is represented as one point \((x,y)\) according to its variable’s values. A cloud of points arises and its shape is a clue to any association between the variables.
The Scatter Plot – Example
import seaborn as snsdf = pd.read_csv("county.csv")sns.scatterplot(data=df, x='homeownership', y='multi_unit')
Figure 1: Scatterplot Comparing homeownership to multi-unit dwelling by U.S. county
Of the three, which species is easiest to identify? How is it recognized?
What’s the best way to distinguish between the other two species?
Do flowers with wider petals usually have wider sepals too?
There are fewer dots on this scatterplot. What does that mean about the flower data?
Figure 3: Scatterplot petal width to sepal width for Fisher’s irises.
What’s a Box (-and-whisker) Plot?
A visualization of the distribution of one quantitative variable, an alternative to a histogram.
reveals the full range of data values along the \(x\)-axis.
divides the data range into four “equal” parts, called quartiles
the quartiles usually have unequal width, but
each quartile’s size is adjusted to include exactly a quarter of the data points.
the inner two quartiles are drawn as a box, and the outer two are drawn as “whiskers”
The five quartile boundary points are called: Min, 25%, 50%, 75%, and Max.
Box (-and-whisker) plot – Basic Example
sns.boxplot(data=df, x='poverty')
Based on the boxplot, what is a typical homeownership percent for U.S. counties?
For what x-regions are the data points tightly clustered? Where are they more thinly spread?
Figure 4: Boxplot for homeownership by county.
Box (-and-whisker) plot – Rich Example
Since a boxplot is so narrow, it can be stacked together to compare many related distributions across categories:
sns.boxplot(data=df, x='Attack', y='Type')
Figure 5: Boxplot for Attack value of various pokemon, separated by type.
The Box Plot – more serious example
sns.boxplot(data=df, x='', y='')
Compared to “unemployed”, “employed” people tend to be… younger? older? More numerous? Farther right?
What is the median age for “employed” and “not in labor force”? Speculate why.
If there are relatively few “unemployed” people in the data, how could you know from the box plot?
What five numbers can be read from the boxplot? Infer these numbers from the topmost plot.
How can you read IQR from the boxplot? How long are the “whiskers allowed to be?
Figure 6: Boxplot for homeownership by county.
Whisker Technicalities
Customarily, whiskers aren’t allowed to be more 1.5 times as long as boxes. If a boxplot would be drawn with long whiskers, trim them to 1.5 * [box size], and represent data beyond this length as individual dots. Both Python and OpenIntro do this. You need to know this to answer questions like “what’s the maximum data value?” using a boxplot.
The Box Plot – creation
Once a dataframe is loaded, it’s not hard to make a boxplot:
import pandas as pd #Needed once, not for every plot import seaborn as sns #Needed once, not for every plotdf = pd.read_csv("filename.csv")sns.boxplot(data=df, x='Attack', y='Type')
See examples illustrated on previous slides.
Summary: Which plot type is best for each case?
To illustrate how time spent studying relates to course grade.
To show any relationship between religious identity and GPA.
To illustrate a link between height and weight.
To show the distribution of molecule sizes in a polymer.
To illustrate total sales by product type.
To visualize the masses and temperatures of thousands of stars.
To show whether pokemon with higher attack also have higher defense.
Load the OpenIntro Run17 data. Which plot type best addresses each question?
What genders are represented and how many of each?
How are the run times distributed?
Are the run times unimodal? Bimodal? Something else?